Post-Processing of MCMC

Markov chain Monte Carlo is the engine of modern Bayesian statistics, being used to approximate the posterior and derived quantities of interest. Despite this, the issue of how the output from a Markov chain is post-processed and reported is often overlooked. Convergence diagnostics can be used to control bias via burn-in removal, but these do not account for (common) situations where a limited computational budget engenders a bias-variance trade-off. The aim of this article is to review state-of-the-art techniques for post-processing Markov chain output. Our review covers methods based on discrepancy minimisation, which directly address the bias-variance trade-off, as well as general-purpose control variate methods for approximating expected quantities of interest.


INTRODUCTION
The Bayesian statistical framework is operational, in the sense that a user first elicits their a priori belief and then updates their belief in light of data, in a way that is (at least in principle) uniquely prescribed.This updating is codified by Bayes' rule, which expresses parameters posterior probability density as being proportional to the product of a priori probability density and the data likelihood.Certain combinations of a priori belief and likelihood are conjugate, meaning that the posterior can be analytically computed.Outside of the conjugate setting, computational methods are required.The computational challenge, then, is to accurately approximate an intractable probability distribution, meaning a distribution whose density function is available up to proportionality, where the normalisation constant is an intractable integral.The majority of Bayesian analyses produce a posterior that is intractable, as indeed do other statistical frameworks (such as generalised Bayesian inference; Bissiri et al. 2016).There has, accordingly, been extensive research into computational methods for approximating intractable distributions.The focus of this review is on Markov chain Monte Carlo (MCMC) methods, a large class of computational methods which, for several decades now, have been considered among the state-of-the-art.Given an intractable distribution, one can typically find several methods in the MCMC literature that can be applied.However, the effectiveness of a particular method is not always easy to predict.Furthermore, once an MCMC method has been applied, it is not always easy to determine the quality of the approximation produced.Typical situations where these challenges occur include applications of Bayesian statistics in which the parameter space is high-dimensional, the likelihood has high information content, causing the posterior to present multiple modes or concentrate on manifolds, and settings where computational complexity limits the number of evaluations of the likelihood (Brooks et al. 2011).
Post-processing procedures aim to improve the quality of estimators that are based on MCMC output, either to approximate the probability distribution itself or a derived quantity of interest.The main practical requirement of a post-processing procedure is that it should be agnostic to the details of the MCMC method.The best known examples of post-processing procedures are burn-in removal and thinning.In burn-in removal, one attempts to identify a number of iterations after which the Markov chain can (informally) be said to have converged to the parameters posterior distribution, and then removes the initial part of the output where the chain had not converged.This procedure can reduce bias in the MCMC output by reducing the dependence on how the MCMC was initialised, but it does not consider variance of the resulting estimators, which depends on the sample size and thus may be large if most of the chain is removed.In thinning, every kth iteration is retained and the remainder discarded, in order to reduce the positive correlation between the remaining states, and therefore reduce the asymptotic variance of the estimators.This can facilitate compression of MCMC output but does not always improve the approximation that is produced (notoriously, thinning does not lead to an efficiency gain if the samples are only used to estimate the posterior expectation of an inexpensive function).
It is thus notable that post-processing of MCMC engenders a bias-variance trade-off and yet standard post-processing procedures do not attempt to address this trade-off.
This review focuses on modern post-processing techniques that can be applied to MCMC output.Our discussion focuses on MCMC, for which consistency results have been established, but much of what we discuss is amenable to application in other computational methods that produce a collection of representative values as output, such as sequential Monte Carlo (Chopin 2002).To limit scope, our focus is principally on continuous-valued random variables, rather than discrete or categorical variables, but where possible we aim to keep discussion general.The paper is structured as follows: notation is established in Section 1.1, background on Markov chains is recalled in Section 1.2, and a formal problem statement is provided in Section 1.3.Section 2 focuses on the task of approximating the full probability distribution using MCMC output; we recall the standard approaches of burn-in removal and thinning, before describing modern and powerful techniques based on discrepancy minimisation in detail.In Section 3 we focus on the task of approximating one or more scalar quantities of interest.Control variate methods represent a powerful computational tool in this context, and we discuss the state-of-the-art in control variate methodology in detail.A brief discussion concludes in Section 4.

Notation
For this paper we use (Ω, F, P) to denote an underlying probability space on which all random variables are (often implicitly) defined and we let E[ • ] = • dP.Conditional probabilities are defined in the standard sense of Kolmogorov (1956) and denoted P(F |G), F, G ∈ F. For this paper we introduce a measurable space (X , B) and consider a random variable to be a measurable function X : Ω → X , whose distribution P is defined as P (B) := P(X ∈ B) for all B ∈ B, where the conventional shorthand "X ∈ B" is used for the event {ω ∈ Ω : X(ω) ∈ B} ∈ F .In the Bayesian context, X represents the parameters of a statistical model and P represents the posterior distribution after data have been assimilated.Let L 2 (P ) be the vector space of random variables f : X → R with f 2 dP < ∞.Let δ(x) denote the distribution of the random variable f (ω) = x for all ω ∈ Ω.For a differentiable function f : R d → R, we denote the gradient of f as ∇f where (∇f )(x) := (∂x 1 f (x), . . ., ∂x d f (x)) .For a differentiable function f : R d → R d , we denote the divergence of f as ∇ • f where (∇ • f )(x) := ∂x 1 f1(x) + • • • + ∂x d f d (x), and if f is twice-differentiable we denote the Laplacian of f as ∆f where ∆f := ∇ • (∇f ).Natural numbers excluding zero are denoted N and including zero are denoted N0.The vector of ones is denoted 1, the unit vector (1, 0, . . ., 0) is denoted e1 and x denotes the Euclidean distance

Markov Chains
A Markov chain is a sequence (Xn) n∈N of random variables Xn : Ω → X with the property that Xn+1 ⊥ ⊥ (Xm)m<n | Xn, where X ⊥ ⊥ Y | Z indicates that the random variables X and Y are conditionally independent given the random variable Z.

In this paper we assume
Markov chain: An ordered sequence of random variables Xn, such that X n+1 is conditionally independent of (Xm)m<n given Xn.
a non-random initial state X0 ∈ X .To a Markov chain we can associate a sequence of transition kernels Pn(x, B) represents the probability that the state Xn of the Markov chain takes a value in the set B, given that the previous state Xn−1 was equal to x.The chain is said to be timehomogeneous if Pn does not depend on n.Inductively define the nth step transition kernel as P n (x, B) := Pn(y, B)P n−1 (x, dy), x ∈ X , B ∈ B, with base case P 0 (x, B) = 1 if x ∈ B, and 0 if x / ∈ B. That is, P n (x, B) represents the probability that the state Xn of the Markov chain takes a value in the set B, given that the initial state X0 was equal to x.A Markov chain is said to be P -invariant if P n (x, B)dP (x) = P (B) for all n and all B ∈ B. Intuitively, if one was to randomise the initial state X0 by sampling it from P , then the state Xn will also have distribution P if the Markov chain is P -invariant.
Loosely speaking, a P -invariant Markov chain might be described as ergodic if P n (x, B) approximates P (B) in the n → ∞ limit, for all x ∈ X , B ∈ B. Several notions of ergodicity exist in the literature, but in this paper we focus on a specific notion called V -uniform ergodicity, which will now be defined.For a function V : ρ n for all n ∈ N and all x ∈ X .A comprehensive treatment of Markov chains can be found in the textbook of Meyn & Tweedie (2012).

Problem Statement
Consider an intractable probability distribution P .Our aim is to compute an approximation, either to the distribution P itself (Section 2) or to derived scalar quantities of interest (Section 3).Our starting point is one1 realisation (i.e. based on one random seed ω ∼ P), of a finite portion (Xn) n≤N of the Markov chain2 , which we call the MCMC output.It is not assumed that the Markov chain is P -invariant unless stated, and later we discuss how output from a MCMC method that is Q-invariant may nevertheless enable P to be consistently approximated if Q is not too dissimilar to P .All approximations are to be constructed by post-processing the MCMC output.In other words, we may only consider properties of P defined locally at the states Xn and no further exploration of X outside this finite set is permitted.In particular, we exclude the trivial solutions of simply running further iterations of MCMC or adopting a different, possibly better MCMC method.This set-up is realistic, reflecting the scenario that a practitioner has invested considerable resources into producing MCMC output and wishes to employ post-processing techniques to extract as much value as possible from their investment.
MCMC output: A single realisation (or "sample path") of a Markov chain, of which the first N states are provided.
An important preliminary comment is that the post-processing techniques described in this article (and indeed most MCMC methods) are not parametrisation invariant.This means that, if one were to apply an invertible transformation Yn = y(Xn), then postprocessing of the MCMC output (Yn) n≤N can lead to different conclusions compared to if (Xn) n≤N had been post-processed.To limit scope we do not discuss parametrisation in this article.Instead, following standard practice, we presuppose that one has employed transformation(s), such as centering and scaling (Yu & Meng 2011), that (loosely speaking) promote simplicity, in order that P can be more easily approximated.

APPROXIMATION OF THE POSTERIOR DISTRIBUTION
The outcome of an exploratory Bayesian analysis is the posterior distribution itself, expressing a posteriori belief about unknown parameters on the basis of a priori belief and evidence provided by the dataset.To facilitate exploratory Bayesian analysis outside the conjugate setting, it is therefore important that the entire posterior distribution can be accurately approximated.This section studies how MCMC output can be used to produce an approximation to a distribution P of interest.Throughout we consider approximations of the form where w1, . . ., wM ∈ R are weights satisfying M i=1 wi = 1 and π : {1, . . ., M } → {1, . . ., N } is a function that indicates which states from the MCMC output are included in Equation (1).In simple terms, this approximation extracts and re-weights a subsequence of length M from the given MCMC output of length N .
Post-processing MCMC output: Selecting a weighted combination of states from the MCMC output to better represent the posterior distribution P .
Recall the two categories of post-processing discussed in Section 1. First, if the chain is constructed so that its asymptotic law converges to P , then excluding the first b points (the burn-in period) from Equation (1) may help to reduce bias due to the choice of the initial state X0 of the Markov chain.This corresponds to excluding {1, . . ., b} from the image of π, and we discuss standard approaches to this problem in Section 2.1.Second, thinning of MCMC output can be useful when samples are to be used for further computation, especially when the subsequent computation has a high cost.This corresponds to excluding i from the image of π whenever i = 1 modulo k, and we briefly discuss approaches to thinning in Section 2.3.In both cases, uniform weights wi = 1 M are assumed in Equation (1).

Burn-in Removal
Burn-in: The first b states of a P -invariant Markov chain, for which the distribution of Xn, n ≤ b, is deemed to substantially differ to P .
In this section we discuss standard approaches to identification of a burn-in period from given MCMC output, in order to control the bias resulting from an arbitrary choice of initial state X0 for the Markov chain.Our focus is limited to continuous domains X ⊆ R d .Rigorous approaches for selecting a burn-in period b have been proposed by authors including Meyn & Tweedie (1994), Rosenthal (1995), Roberts & Tweedie (1999); see also Jones & Hobert (2001).Unfortunately, these often involve conditions that are difficult to establish or, when they hold, they may provide loose bounds on the total variation distance between the law of the Markov chain and the invariant distribution, implying an unreasonably long burn-in period.More recently Biswas et al. (2019) discuss how to estimate such bounds through coupling and multiple MCMC runs, but this is out of the scope considered here, where a single MCMC run has been obtained at moderate to high computing cost.Convergence diagnostics have emerged as a practical solution to the need to test for non-convergence of MCMC.Their use is limited to reducing bias in MCMC output; they are not designed for the setting that we consider, where the length N of the MCMC output is fixed, and which requires a bias-variance trade-off.Nevertheless, convergence diagnostics constitute the most common means by which MCMC output is post-processed in modern software packages for MCMC, including WinBUGS (Lunn et al. 2000), JAGS (Plummer 2003), R (R Core Team 2020), Stan (Carpenter et al. 2017), andPyMC3 (Salvatier et al. 2016).
In this section we recall standard practice for selection of a burn-in period b, and thus (implicitly) in constructing an estimator of the form Equation (1), focussing on the traditional R statistic of Gelman & Rubin (1992), Brooks & Gelman (1998).The aim of this Section is to describe the general idea and fundamental limitations of convergence diagnostics in the fixed N scenario, rather than presenting the state-of-the-art or providing a comprehensive survey of convergence diagnostics for burn-in removal.We simply recall that the R convergence diagnostic was first introduced in Gelman & Rubin (1992) and subsequently corrected in Brooks & Gelman (1998), and this was then simplified in Gelman et al. (2003).We use the implementation of the Brooks & Gelman (1998) version from the R package coda in our experiments and we focus on the simple expression of Gelman et al. (2003) in the text.Further developments of the R convergence diagnostic include Gelman et al. (2013), where the diagnostic test is performed separately on each half of the MCMC output; Vats & Knudson (2018), that revisits a connection between R and effective sample size (ESS) of quantities of interest estimated from the MCMC output; 3 Vehtari et al.  (2021), that provides more details on such connections, addressing also target distributions with infinite variance, and the case in which the Markov chain is exploring the bulk of the target distribution, but not its tails.A comprehensive survey of convergence diagnostics for MCMC can be found in Roy (2020).
The traditional R statistic of Gelman et al. (2003) is not a post-processing method in the strict sense set out in Section 1.3, because it is based on l = 1, . . ., L independent realisations of MCMC output, (X l n ) n≤N ; i.e.X l n denotes the random variable Xn evaluated at ω l , where ω1, . . ., ωL ∼ P are independent.For a uni-dimensional target distribution, the traditional R statistic is defined as the square root of the ratio of two estimators of the variance σ 2 of the target: where s 2 is the (arithmetic) mean of the sample variances s 2 l along the L sample paths (X l n ) n≤N , and it typically provides an underestimate of σ 2 ; while σ2 is constructed as an unbiased overestimate of the target variance where m l is the sample mean from the lth sample path, and where the second term is the sample variance of the sample means from L chains.
For an ergodic Markov chain, R

Sample mean and variance:
The sample mean m l and variance s 2 l of a MCMC output (X l n ) n≤N are defined, respectively, as 1 N X l n and , where the sums run over n = 1, . . ., N .
converges to 1 as N → ∞.In practice, it is common to discard a burn-in period of length b = N , where N is the smallest integer for which R < 1 + δ, and δ is a suitable threshold4 .The somewhat arbitrary choice of δ = 0.1 has historically been used (Gelman et al. 2013), and current best practice for traditional R and its extensions advocates δ = 0.01 (Vehtari et al. 2021).
Convergence diagnostics can help to detect situations in which a Markov chain has not converged, and for this purpose they are widely used.Their main drawbacks are (a) such diagnostics do not provide guarantees that the Markov chain has actually converged (existing convergence diagnostics can assess only necessary but not sufficient conditions for convergence); (b) burn-in removal may not be useful in practical settings where the MCMC output has already been obtained and post-processing is required, as described in Section 1.3.In order to mitigate the first point, Vats & Knudson (2018), Vehtari et al. (2021) recommend to look at the effective sample size of quantities of interest, if possible combining autocorrelation information from multiple chains, which helps detecting poor convergence in cases of multimodal target distributions.However, this still remains only a necessary, not sufficient condition for convergence, and it does not help tackling the second point.This section ends with an example to highlight this important second drawback of convergence diagnostics: Example (Burn-in removal lacks a bias-variance trade-off).The purpose of convergence diagnostics is to detect and avoid bias due to dependence on the arbitrary choice of initial state X0.However, burn-in removal does not address the bias-variance trade-off that occurs when the MCMC output is fixed.As an extreme illustration of this, consider the MCMC output shown in Figure 1.Here L = 6 independent sample paths of total length N = 10 3 were produced using random walk Metropolis-Hastings (Metropolis et al. 1953).A simple bivariate target P , whose contour lines are plotted in red, was used, but the Markov chain was not optimised, to simulate a challenging sampling context.The initial states X l 0 were over-dispersed relative to the target P , requiring the Markov chain to take several steps before the high probability region is reached.Figure 2 applies convergence diagnostics to establish whether or not the Markov chains can be said to have converged.
The traditional R statistic of Brooks & Gelman (1998) (black solid line) detects nonconvergence even after all N = 10 3 iterations of the MCMC output have been considered, irrespective of whether the diagnostics are applied to each coordinate of the state vector or jointly to both coordinates.This is undesirable from our perspective of post-processing MCMC output, since it is clear from Figure 1 that there is useful information in the MCMC output, even if some dependence on X0 can be detected.In addition, we present in Figure 2 two of many proposed improvements over Brooks & Gelman (1998): the recent diagnostic due to Vats & Knudson (2018) (blue solid line), and also a version of such convergence diagnostic (red solid lines), presented in the same work, and that can be computed using a single MCMC output.They indicate that the burn-in period has finished, but they leave only a small portion of the chain after the burn-in, when considering the threshold δ = 0.01.All convergence diagnostics were computed using the R packages coda (Plummer et al. 2006) and stableGR (Knudson & Vats 2020).
The modern MCMC post-processing techniques presented in Section 2.3 address this bias-variance trade-off, and their use is encouraged in problems where obtaining further MCMC iterations is not practical.

Fixed Frequency Thinning
As with the classical approaches to burn-in removal discussed in Section 2.1, thinning is often performed on a heuristic basis as the simplest way to achieve compression of MCMC output.In exploratory Bayesian analysis, this is often motivated by the need to reduce storage cost or to make subsequent computation faster (the reader can, for example, think about the case in which the samples obtained from the posterior are used for forward uncertainty propagation in complex multi-scale models).However, thinning is traditionally performed with no specific aim to improve the accuracy of the MCMC output.
Figure 1: MCMC output.Here we show L = 6 independent realisations of MCMC output (gray lines), for a particular bivariate distributional target P indicated by the shaded contour plot in the background.In each case a total of N = 10 3 iterations of the Markov chain were performed, with the first 500 iterations plotted.2018) (VR; blue and red solid lines), each as a function of the number of iterations N of the Markov chain that are considered.These diagnostics were applied separately to the first and second coordinates of the bivariate state variable (left and central panels) and jointly to both coordinates (right panel).Dashed lines indicate thresholds at which convergence is deemed to have occurred.In all cases, the traditional R statistic does not fall below the thresholds, indicating that convergence has not occurred.
The most common approach to thinning is to sub-sample with a fixed frequency from the chain ('retain every kth sample and discard the rest'), which can be an effective strategy to reduce auto-correlation in the MCMC output.Systematic approaches for determining k do exist, and the most well-known is based on the auto-correlation estimator of Geyer (1992), that can be computed using the R package LaplacesDemon (Statisticat & LLC. 2021).This method estimates a sequence of fixed-lag auto-correlations in the Markov chain, and then thresholds this sequence to give a k that results in a subsample that is close to uncorrelated.This procedure is most useful in exploratory Bayesian analysis, where a set of such samples are themselves required, rather than as an attempt to improve an estimator.See Owen (2017) for a discussion on the statistical efficiency of this approach.
More sophisticated approaches to compress MCMC have been explored by authors including Paige et al. (2016) and Mak & Joseph (2018).In both cases, the authors aimed to construct an approximation of the form in Equation ( 1) with M N , such that Equation (1) provides an accurate approximation to the discrete distribution 1 N N n=1 δ(Xn) supported on the original MCMC output.Although these can provide effective compression of MCMC output, if the Markov chain has not converged then the compressed output will be biased.In the next section, we discuss an approach that is simultaneously capable of thinning and de-biasing MCMC output, and is applicable even in cases where the Markov chain is not P -invariant.

Discrepancy Minimisation
Here we discuss modern and powerful approaches to post-processing of MCMC that aim to directly address the bias-variance trade-off just described.The approach we will explore casts the choice of π in Equation ( 1) as an optimisation problem.The key idea is to identify an appropriate quantification of the discrepancy between the discrete distribution QM in Equation ( 1) and the distributional target P , and then to select both the weights wi and the index sequence π so that this discrepancy is minimised.A discrepancy is defined as a bivariate function D such that D(P, P ) = 0 for all distributions P , and D(P, Q) > 0 for all P = Q.There are an infinitude of functions D that satisfy these relations, so for a discrepancy to be useful we typically require several other properties.An important property, which we will not discuss further in this paper since it is often satisfied, is that D(P, QM ) → 0 whenever QM converges to P in an appropriate sense.Another important property, which is the converse of the property just described and which we will discuss, is called convergence control, where D(P, QM ) → 0 implies that QM converges to P in a sense that must be specified.
Discrepancy: A discrepancy D is a non-negative function where D(P, Q) is interpreted as the dissimilarity between measures P and Q.

Convergence control:
As an aside, we note that many related strands of work exploit discrepancy to approximate a distributional target P by a discrete distribution Q N .For example, in quasi Monte Carlo an explicit construction Q N = 1  2019).In both of these methods the goal is to find a compressed representation of P , and as a starting point it is assumed that P is known in full.This is not the case when one has to post-process MCMC output.An approach considered by Kyriazopoulou-Panagiotopoulou et al. (2008), Kontoyiannis & Meyn (2008) is to adjust estimates based on the discrepancy between the expectation under P and the expectation under Q N of a reference function, although this discrepancy is not convergence-determining for a finite number of reference functions.To overcome these problems a specialised family of discrepancies will be required.These are introduced next.Here a Markov chain sample path (gray line) is post-processed to select M = 12 representative states (black circles), such that the discrete measure supported on these representative states provides an accurate approximation to the same distributional target P as in Figure 1, indicated by the shaded contour plot in the background.

Stein Discrepancy Minimisation.
Our aim is to select an appropriate discrepancy D for post-processing of MCMC.To this end, we focus on Stein discrepancy, and in particular a kernel Stein discrepancy constructed for the case where the domain X is R d ; see the three inset boxes for definitions and detail.The main computational requirement when using Stein discrepancy is the evaluation of the gradient ∇ log p along the MCMC sample path, where p is a density function for P .Note that gradient-based samplers, such as the Metropolis-adjusted Langevin algorithm (Roberts & Stramer 2002) or Hamiltonian Monte Carlo (Duane et al. 1987), produce the required evaluations as a by-product.Stein discrepancy is particularly well-suited to post-processing of such MCMC output since, under appropriate technical assumptions, it (a) allows explicit computation of D(P, QM ), and (b) provides convergence control, meaning in this context that D(P, QM ) → 0 implies QM converges weakly to P .
Stein Discrepancy: A discrepancy D such that D(P, Q) can be computed when P is an intractable distribution and Q has a finite support.
The optimisation problem we are interested in thus reduces to the problem of identifying weights wi and an index sequence π for which the kernel Stein discrepancy D(P, QM ) is minimised when QM is the discrete distribution in Equation ( 1  Combinatorial optimisation to elicit an index sequence π for which the kernel Stein discrepancy D(P, QM ) is minimised presents some technical challenges, which we defer discussion of until Section 2.3.2.Here we describe the simple, sequential approach called Stein thinning that was explored in Riabiz et al. (2021).This involves constructing π in a sequential, greedy manner, in which at iteration 1 ≤ j ≤ M , an index π(j) is selected

STEIN DISCREPANCY
A Stein discrepancy is a discrepancy of the form where FP is a set of functions chosen to satisfy f dP = 0.For a sufficiently large set FP it is possible to have D(P, Q) = 0 imply P = Q.One way of achieving this is by taking FP to be the set of functions of the form f (x) = h(x) − hdP , with h ranging over a measure-determining set H (such a D is recognised as an integral probability metric; see Müller 1997).However, for intractable P the presence of the integral hdP renders this choice impractical.Building on Stein (1972), the recent work of Gorham & Mackey (2015) proposed an alternative approach that can be applied provided that P admits a positive and differentiable density on X = R d .Let the set FP be composed of functions of the form where or, equivalently, using the explicit form of kernel Stein discrepancy in Equation ( 6) of the inset box, This procedure has computational complexity O(N M 2 ), or possibly less (since it is possible for a state to be repeatedly selected and the relevant quantities to be cached).
The main conceptual advantages of Stein thinning and related algorithms, compared to the standard post-processing techniques described in Section 2.1 and Section 2.2, are that (a) they directly address the bias-variance trade-off, (b) they can correct for systematic bias in the MCMC output, (c) they can automatically identify and remove a burn-in period.The main practical limitation of Stein thinning and related algorithms is that there are certain pathologies of Stein discrepancy, which occur when either (a) P has distant highprobability regions, or (b) P is high-dimensional (e.g.d > 100), either of which can lead to poor approximations when M is small; see Wenliang (2020).To illustrate the potential advantages of Stein thinning, we now present a special case of Theorem 3 in Riabiz et al.

TAIL CONDITION FOR STEIN DISCREPANCY
To construct a Stein discrepancy we require a set of functions h : R d → R d for which the Stein identity (AP h)dP = 0 holds, with AP defined in Equation ( 3).This can be formulated as a tail condition on h.The main idea is to recognise AP h as a divergence operator and exploit the divergence theorem over a ball B(r) of radius r > 0, centred at the origin in R d : Here dV denotes the volume element in B(r), dσ denotes the surface area element on the boundary of B(r), and n denotes the unit normal to the boundary of B(r).In order for this final term to vanish, it is sufficient that ph • n vanishes uniformly with respect to the surface area B(r) dσ, which is O(r d ).Thus, if h : R d → R d and log p : R d → R are both continuously differentiable and the tail condition is satisfied for some C ∈ R, some δ > d − 1, and all x ∈ R d outside of a bounded set, then the Stein identity is satisfied (South et al. 2021).
(2021), which describes conditions under which the sequence generated using the Stein thinning algorithm in Equation ( 8) produces a discrete approximation QM that converges almost surely to P .Note in particular that the result does not assume that the Markov chain is P -invariant.
Theorem (Bias correction for MCMC).Let P , P be probability distributions with positive and continuous densities p and p on R d .Assume that the tails of P are distantly dissipative (a relaxation of log concavity; see Gorham et al. 2019) and that p is continuously differentiable on R d .Consider a P -invariant, time-homogeneous Markov chain (Xi) i∈N generated using a V -uniformly ergodic transition kernel, where V (x) = p(x) p (x) d + ∇ log p 2 .Suppose that, for some γ > 0, the following moment condition is satisfied: Let π be an index sequence of length m produced by Equation (8) applied to the MCMC output (Xn) n≤N , where kP in Equation ( 5) is based on the inverse multi-quadric kernel This result, and the related results in Liu & Lee (2017), Hodgkinson et al. (2020), weaken or remove the requirement to design Markov chains that are exactly P -invariant.See also Gramacy et al. (2010), Radivojević & Akhmatskaya (2020).Less formally, this result suggests that one may not need to run a Markov chain to convergence in order for its output to be useful.On the other hand, the moment condition in Equation ( 9) imposes a requirement that P cannot be too dissimilar to P (informally, the Markov chain must

KERNEL STEIN DISCREPANCY
To facilitate computation of the supremum in Equation ( 2), one can specialise to a particular form of Stein discrepancy called kernel Stein discrepancy.A kernel is a symmetric, positive-definite function k : X × X → R. A kernel k reproduces a Hilbert space, denoted H(k), whose inner product is denoted •, • H(k) .This means that the elements of H(k) are functions f : X → R, and it holds that (i) k(•, x) ∈ H(k) for all x ∈ X , and (ii) f, k(•, x) H(k) = f (x) for all x ∈ X , f ∈ H(k).For example, the Gaussian kernel k(x, y) = exp(−(x−y) 2 ) reproduces a Hilbert space that contains functions of the form f (x) = m i=1 wi exp(−(x−yi) 2 ) for all wi ∈ R, yi ∈ R, m ∈ N, as well as certain limits of such functions (Berlinet & Thomas-Agnan 2011).
The main observation here is that if we take the set for all wi ∈ R, xi ∈ R d , n ∈ N. The kernel Stein discrepancy in Equation ( 6) can therefore be exactly computed whenever the gradient ∇ log p can be evaluated.Furthermore, under certain conditions, the kernel Stein discrepancy provides weak convergence control (Gorham & Mackey 2017, Theorem 8).
explore the high density regions of P , albeit not necessarily with the same frequencies as would be expected if the chain was P -invariant).Being a recent line of research, it remains to be seen whether these advances in the post-processing of MCMC output will in turn influence the design of algorithms for MCMC.Software to perform Stein thinning, including packages for R, Python and MATLAB, is available at stein-thinning.org.

Extensions to Stein Discrepancy
Minimisation.The Stein thinning algorithm that we just described in Equation ( 8) is myopic, in that it selects the index of the single best state π(j) at each iteration without consideration of whether this makes subsequent choices better or worse overall.This myopia can make the optimisation statistically inefficient, as observed in the left panel of Figure 4. Specifically, we see that the choice of the first state as the sample closest to the global mode of the distribution means that all possible choices for the second state will (temporarily) significantly worsen the overall approximation.In a less extreme fashion, this can also be seen for state 6, and state 8, by similar symmetry observations.A second shortcoming of the algorithm we just presented is that it requires scanning through the entire MCMC output of size N at each iteration, which can lead to an unacceptable computational cost.
Non-myopic: An optimisation algorithm is non-myopic if it looks further than a single step ahead when deciding the best course of action at a given iteration.To ameliorate these shortcomings, at least to an extent, the Stein thinning algorithm can be generalised to both non-myopic and mini-batch settings as described in Teymur et al. (2021).The first of these extensions involves selecting multiple points simultaneously, while the mini-batch extension considers, at each iteration, selecting points from a random subset of the samples along the MCMC path.It was shown in Teymur et al. ( 2021) that these two extensions are synergistic, in that non-myopic optimisation is most useful in the mini-batch context.These extensions of Stein thinning will now be described.Let B N be a mini-batch size and let (X j b ) 1≤b≤B,1≤j≤M be the collection of mini-batches, each of size B and to be used at iteration j.For example, the mini-batches could be chosen uniformly, with or without replacement, from the MCMC output (Xn) n≤N .Let s ∈ N be the lookahead horizon, meaning the number of points to be simultaneously selected (the algorithm in Equation (8) corresponds to s = 1).Then we choose a vector π(j, •) of s indices to be used at iteration j by performing the optimisation which we have obtained using the explicit form of kernel Stein discrepancy in the inset box, in a similar manner to how we obtained Equation (8).Run for M iterations, this algorithm selects a representative set of sM states, potentially with some states selected more than once.It is possible to apply a similar theoretical analysis to that in Section 2.3.1 to the generalised algorithm in Equation ( 10 Implementation of this non-myopic algorithm requires solving the optimisation problem in Equation ( 10).This is a potentially challenging combinatorial optimisation problem, and is only tractable when the batch size B is small (e.g.B ≤ 1000).In order to solve it, one can represent the indices S ⊂ {1, . . ., B} s of the s points to be selected at iteration j as a vector v j ∈ N s 0 := {0, . . ., s} B whose ith element indicates the number of copies of X j i that are selected from the jth mini-batch, where v j is constrained to satisfy B i=1 v j i = s.It is then an algebraic exercise to recast an optimal index sequence π(j, •) as the solution to a constrained integer quadratic programme (e.g.Wolsey 2020): kP (X π(i ,j ) , Xj).

11.
Integer quadratic programme: An optimisation problem in which the objective function is quadratic, and where the solutions are constrained to be integer-valued.
Depending on the values of B, M and N , and the way in which the mini-batches are selected, it may be advantageous to store and reuse kernel calculations from iteration to iteration.In general, however, we assume that the matrix K j P and vector c j are recalculated for each batch, giving the algorithm an overall complexity of O(M 2 s 2 B s ).This apparently daunting computational complexity can nevertheless be advantageous if N is very large and B N .Teymur et al. ( 2021) recommends a ratio s/B ≈ 10, though this is expected to be problem-dependent.Finding the exact solution of this type of optimisation problem is NPhard5 , however a 'good' feasible solution may still be useful.Indeed, the iterative nature of the overall algorithm allows it to compensate, to a degree, for sub-optimal selection of states at a given iteration through its selection of states in future iterations.Fortunately, 'good' solutions can readily be obtained using any of a number of packaged discrete optimisation routines, such as the commercial software gurobi, MOSEK and MATLAB's Optimization Toolbox, or numerous open-source equivalents.

Summary
This completes our review of post-processing strategies for MCMC, when the aim is to accurately approximate the distributional target P itself.Given that convergence diagnostics and thinning are well-known techniques, we deliberately focussed on their shortcomings in this review.Then, we described recent methodology that aims to directly address the bias-variance trade-off that occurs when post-processing MCMC output.This trade-off is fundamental to many important and challenging applications of MCMC, in which there is a practical limit to the computational budget.To limit scope, we did not discuss alternative classes of algorithm, such as unbiased Monte Carlo (Jacob et al. 2020), for which a biasvariance trade-off is systematically avoided.Finally, it was argued that recent developments in Stein discrepancy have the potential to substantially impact on both applications of, and research into, MCMC.

APPROXIMATION OF POSTERIOR EXPECTATIONS
In contrast to exploratory Bayesian analyses, several applications of Bayesian statistics require just a finite number of scalar posterior quantities of interest.For example, in a decision-making context, the Bayes rule may take an explicit and simple form, such as the mean of the posterior or perhaps a median, or a higher moment (Berger 2013).To proceed, one can first run MCMC, followed by suitable post-processing as described in Section 2, to obtain an approximation to the posterior from which quantities of interest can be extracted.
However, approximating the full posterior may incur unnecessary computational effort.In such circumstances it is natural to seek to focus computational resources on approximating just the quantities of interest.
Control variates are a classical technique for reducing the variance of Monte Carlo estimators, which are used in a wide range of applications, including stochastic gradientbased optimisation (Wang et al. 2013, Grathwohl et al. 2018) and as part of MCMC methods themselves (Baker et al. 2019).In this section we review the use of control variates as a post-processing technique for MCMC.It will be shown that modern control variates, unlike their classical counterparts, can facilitate bias removal, as well as variance reduction.In Section 3.1 the control variate technique is presented at a general level, then in Section 3.2 we present specific control variates techniques and explain how these can be used to postprocess MCMC.

Monte Carlo Estimators
For the purposes of this article, a Monte Carlo estimator is a map µ : Ω × L 2 (P ) → R whose output µ(ω, f ) depends on ω only via dependence on a collection of random variables X1(ω), . . ., Xn(ω).The output, µ(ω, f ), is interpreted as an approximation to the integral f dP , which we consider to be a scalar quantity of interest.Our focus is on Monte Carlo estimators that are based on MCMC output, with the standard example being the estimator that takes an average of f over the states (Xn) n≤N , in the MCMC output.Such an estimator is said to be consistent if, for all f ∈ L 2 (P ), the random variable in Equation ( 12) converges in probability to f dP as N → ∞.The asymptotics of Equation ( 12) are well-studied in the setting where the Markov chain is P -invariant (Meyn & Tweedie 2012).Improved approximations can be obtained using the methods described in Section 2. For example, post-processed MCMC output of the form in Equation ( 1), can be used to provide a Monte Carlo estimator 13.
In the setting where the Markov chain is not P -invariant, Equation ( 12) will be asymptotically biased in general but Equation ( 13) may yet be consistent, as explained in Section 2.3, and therefore Equation ( 13) may be preferred.In the presence of several consistent estimators, it is natural to ask which estimator should be preferred; this question can be rigorously formulated in terms of the mean square error of the estimators and the answer will be f -dependent in general.For convenience we will leave the ω argument implicit in the remainder of this section.
3.1.1.Selecting a Monte Carlo Estimator.The mean square error of a Monte Carlo estimator µ is defined as Presented with a collection {µ θ } θ∈Θ of Monte Carlo estimators, say indexed by θ ∈ Θ, we would like to select an estimator for which MSE(µ θ (f )) is minimised.Let us assume that the mean square error can itself be consistently estimated based on the MCMC output, i.e. we have available an estimator MSE(µ(f )).Then a general recipe to select a Monte Carlo estimator is as follows:
There are at least three possible shortcomings with this general recipe, which will be discussed.First, it is not clear how one should identify an appropriate set of Monte Carlo estimators; control variates provide an elegant solution that we discuss next in Section 3.1.2.Second, it may be a challenging to identify a suitable estimator for the mean squared MSE, since the underlying MCMC method may be complicated.Options for this are discussed in Section 3.1.3.Third, estimation error in MSE presents a challenge when there are many Monte Carlo estimators being compared, since with more estimators there is a greater chance of selecting a poor estimator due to bad luck.A solution to this problem requires that the size of the set of candidate estimators is controlled in some way commensurate with the error in MSE.Several solutions will be discussed in Section 3.2, including restricting the size of this set through the use of explicit finite dimensional bases, and through coupling the size of Θ to the size N of the MCMC output.
It is emphasised that, compared to the techniques reviewed in Section 2, the selection of Monte Carlo estimators remains as much an art as a science.Theoretical analyses are available on some aspects of the general recipe just outlined, and will be highlighted, but to our knowledge there does not yet exist a theoretical treatment that is broadly applicable in the MCMC context.

Constructing Monte Carlo Estimators Using Control
Variates.An element g ∈ L 2 (P ) is said to be a control variate (for P ) if gdP = 0. Clearly any finite linear combination of control variates is also a control variate, and we will use G to denote a linear subspace of L 2 (P ) whose elements are control variates.The power of control variates is that they enable one to take a single Monte Carlo estimator, such as Equation ( 12), and from this to generate a possibly large collection of Monte Carlo estimators.Indeed, armed with a consistent Monte Carlo estimator µ and a set of control variates G, one can consider Monte Carlo estimators of the form µ θ (f The consistency of µ is automatically inherited by each µ θ .
Control Variate: A square-integrable function whose expectation is 0.
Up to this point we have not discussed how control variates can be found in practice.Many approaches for developing control variates in the context of Markov chain sampling are based on approximating the solution f to the typically intractable Poisson equation 15.
where K is the one-step ahead prediction operator In this setting, one could evaluate E[f ] exactly by evaluating f + K f − f .Andradóttir et al. (1993) propose numerical algorithms to approximate this solution in the context of finite state spaces.Henderson (1997) approximates the solution for specific Markov samplers, focusing on continuous-time processes and applications in stochastic network theory.This was extended in Dellaportas & Kontoyiannis (2012) for reversible Markov chains where K is tractable for some basis functions.A method to approximate the solution to the Poisson equation by discretising the state space for geometrically ergodic Metropolis-Hastings chains is introduced in Mijatović & Vogrinc (2018).Control variates have also been built for independent Metropolis-Hastings samplers (Atchadé & Perron 2005) and for general Metropolis-Hastings samplers (Hammer & Tjelmeland 2008), although the latter approach requires an extension of the state space to include proposals.
The aforementioned control variates are sampler-specific or require adjustments to the MCMC algorithm.Section 3.2 describes sampler-agnostic control variates that are applicable when ∇ log p or an unbiased estimate is available.
3.1.3.Proxies for Mean Square Error.The problem of estimating the mean square error of a Monte Carlo estimator is difficult, due to the fact that both the dependence between the states (Xn) n≤N in MCMC output, and the way that these states are combined in the Monte Carlo estimator µ, can be arbitrarily complicated.See, for example, Flegal & Jones (2010) for strategies that can be used to estimate the mean square error of the Monte Carlo estimator in Equation ( 12).To promote generality, here we consider simple and generic proxies for mean square error that are easily computed, and much of what we recommend is based on empirical evidence only.
A simple proxy for mean square error can be obtained by considering Equation ( 13) in the idealised setting where Xi ∼ P , for which it follows with equality when the Xi are independent.The variance Var(f ) can be estimated using the empirical variance evaluated using MCMC output.Empirical variance minimisation for constructing control variates was studied in Belomestny et al. (2021) for the case where the Xi are independent.For non-independent Xi, arising as MCMC output, a more involved proxy based on spectral approximation of the asymptotic variance was studied in Brosse et al. ( 2019), Belomestny et al. (2020a,b), representing probably the most successful attempt to-date to provide theory for control variates for post-processing MCMC output.On the other hand, a popular and simple upper bound on Equation ( 17) is the least squares estimator 18.
An empirical comparison of empirical variance and least squares estimators for the selection of control variates in Si et al. (2020) reported that, perhaps surprisingly, the least squares estimator performed best.That is, one selects is minimised.The scalar integral of interest is then estimated as where we have used the defining optimality property of θ2 to conclude that the summation in Equation ( 19) is zero.One can equivalently describe this estimator as the result of first solving the weighted least squares regression problem for the intercept θ1 ∈ R and the predictor θ2 ∈ G, where the dataset consists of the (random) covariates xi = X π(i) and independent errors i ∼ N (0, w −1 i ), i = 1, . . ., M , then reporting the fitted intercept θ1 as an approximation to the integral of interest.Next we address the question of how a set G of control variates can actually be constructed.

Gradient-Based Control Variates
Perhaps the main challenge in the application of control variate is identifying a suitable linear subspace G.The elements of G should (i) have known expectation under P , (ii) be easy to compute, and (iii) offer an improvement on a Monte Carlo estimator µ that would otherwise have been used, in the sense that MSE(µ θ (f )) < MSE(µ(f )) for some θ ∈ R × G.In this section we discuss gradient-based control variates that often meet these requirements, focussing on domains X = R d for some d ∈ N.These gradient-based control variates are constructed using mathematical tools similar to those exploited in Section 2. The construction of control variates for non-Euclidean domains is discussed in Barp et al. (2021) for closed manifolds, while the general case, including discrete domains, remains under-developed.
Recall the operator AP defined in Equation (3); i.e.AP h = ∇ • h + ∇ log p • h where h : R d → R d .It was shown (in the inset box) that AP h dP = 0 under an appropriate tail condition on h; it therefore is natural to consider a linear subspace of L 2 (P ) consisting of control variates of the form G = AP Φ = {AP φ : φ ∈ Φ} where Φ := {φ : R d → R d } is a linear space of functions for which the aforementioned tail condition is satisfied.The form for gradient-based control variates described here can be traced to the physics literature (Assaraf & Caffarel 1999, 2003), and was first brought to bear on MCMC in Mira et al. (2013).As discussed in Section 2, the required gradients are produced as a by-product when gradient-based samplers, such as the Metropolis-adjusted Langevin algorithm (Roberts & Stramer 2002) or Hamiltonian Monte Carlo (Duane et al. 1987), are used, making the combination of gradient-based sampling and gradient-based control variates particularly appealing (Papamarkou et al. 2014).
It remains to discuss how the set Φ of differentiable vector fields φ : R d → R d can be selected.In what follows we review some of the main choices that previous researchers have considered.
3.2.1.Finite-Dimensional Basis.Perhaps the simplest choice for Φ is the linear span of a finite set {φ1, . . ., φJ }.There is clearly much flexibility in the choice of the vector fields φj, but a popular choice is to use the gradients of monomials.Specifically, the so-called zero-variance control variates (ZVCV) of Assaraf & Caffarel (1999, 2003), Mira et al. (2013) sets Φ to be gradients of the class of r-th order polynomials, Φ = span{∇x α : α ∈ N d 0 , 0 < |α| ≤ r} where r ∈ N, x α = d i=1 x α i i and |α| = d i=1 |αi|.The number of basis functions is therefore J = d+r d and the associated set G of control variates contains elements of the form 21.
Having identified Φ, we can aim to select an optimal control variate from G using one of the proxies for mean square error discussed in Section 3.1.3.Suppose that J < M and consider the least squares proxy in Equation ( 18).In what follows we consider a Monte Carlo estimator of the form in Equation ( 13), which of course contains, as a special case, the vanilla Monte Carlo estimator in Equation ( 12).Then we solve the regression problem in Equation ( 20) to obtain a fitted regression model 22.
where we collect the regression coefficients together into a vector c = ( θ1, θ2,1, . . ., θ2,J ) ∈ R J+1 .For completeness we now provide an explicit formula for the coefficient vector c in the fitted model.Let Then standard calculations show that selecting c to minimise LS(f − f ) leads to the estimated coefficients being ĉ = (Φ W Φ) −1 Φ W f .The integral f dP of interest is approximated by θ1, the first component of c.The Monte Carlo estimator so obtained will be denoted µ ZVCV (f ) = ĉ e1 = θ1 in the sequel.
A intriguing property of gradient-based control variates with finite-dimensional bases is that, under many of the proxies for mean square error that we discussed, the resulting Monte Carlo estimators are semi-exact, in the sense that MSE(µ ZVCV (f )) = 0 when f ∈ span{1} ⊕ AP Φ.
Recalling that for a Gaussian P , the gradient ∇ log p is a first order polynomial, Semi-exact: A Monte Carlo estimator is semi-exact if it is exact on a linear subspace of L 2 (P ).
semi-exactness in this case carries the interpretation of being exact for polynomials up to a certain order when Φ consists of gradients of monomials, in a similar way to how Gaussian cubature methods are constructed.This explains the "zero variance" nomenclature used in Assaraf & Caffarel (1999, 2003), Mira et al. (2013).
The main problem with using a finite-dimensional basis is that the regression problem is typically misspecified, since f / ∈ span{1} ⊕ AP Φ for most functions f of interest.This limits the variance reduction that can be achieved.To improve convergence rates, one could consider increasing the size of Φ with increasing M , in the spirit of Portier & Segers (2018), South et al. (2018), or using an infinite-dimensional basis with regularisation, as described next.
3.2.2.Infinite-Dimensional Basis.Oates et al. (2017) extended the gradient-based control variates of Assaraf & Caffarel (1999, 2003), Mira et al. (2013) to an infinite-dimensional linear subspace of L 2 (P ).This was achieved by taking Φ = H(k) d to be a Cartesian product of reproducing kernel Hilbert spaces H(k) of sufficiently regular functions; see the inset box in Section 2 for background.The resulting set of control variates is G = AP Φ = H(kP ), which is again a reproducing kernel Hilbert space with reproducing kernel kP (x, y) defined in Equation ( 5).The resulting method was referred to as control functionals (CF), being a non-parametric (or 'functional') generalisation of existing control variates.
The major challenge associated with an infinite-dimensional set G of control variates is over-fitting; there may be infinitely many θ ∈ R × G for which MSE(µ θ (f )) = 0, yet in reality MSE(µ θ (f )) may be arbitrarily large.Consider, for instance, the least squares proxy which can be driven to zero by taking θ2 to interpolate f − θ1 at the nodes X π(i) , i = 1, . . ., M .Constraining θ2 at a finite set of locations does not constrain what θ2 may do outside this finite set, and is therefore not sufficient to provide control on MSE(µ θ (f )).The methodological contribution of Oates et al. (2017) was to select, among the set of θ ∈ R × G for which Equation ( 23) is minimised, an element with minimal semi-norm, where the semi-norm on R × G was defined as |θ| 2 = θ2, θ2 H(k P ) .Under regularity assumptions, it can be shown that there exists a unique such element θ ∈ R × G.Moreover, there is a closed-form solution to this optimisation problem which leads to the estimator µ CF (f ) = (1 ).Note that we may without loss of generality assume that the X π(i) are distinct in Equation ( 13), since otherwise we could consider smaller M and modify the weights accordingly.This ensures that the matrix KP is non-singular whenever kP is a genuine reproducing kernel.An interesting feature, and possible weakness, of CF is that the Monte Carlo estimator obtained does not depend on the weights wi appearing in Equation ( 13).
The performance of CF is heavily dependent on the choice of the kernel k.A common choice is to use a radial kernel k, such that k(x, y) depends only on x − y , with examples including the Gaussian, Matérn and rational quadratic kernels (Rasmussen 2003).Typically such kernels will be parametric, with a small number of parameters that must be specified.Oates et al. (2017) recommended using cross-validation to select kernel parameters , wherein a subset of the {X π(i) , i ∈ Itrain}, are used construct the Monte Carlo estimator µ θ (f ) where θ = θ ∈ R × G and performance of this Monte Carlo estimator associated with is measured by the sum of squared errors , where Itest = {1, . . ., M } \ Itrain.One then selects the kernel parameters for which E is minimised.
Under regularity assumptions, CF has theoretical advantages over ZVCV.Oates et al. (2019), Barp et al. (2021) used results from scattered data approximation (Wendland 2004) to prove that, in the uniformly weighted case (i.e.wi = 1 M ), the expected error where here s is the number of (weak) derivatives of the function f whose integral is sought.This indicates that the use of CF for post-processing MCMC output can actually improve the convergence rate of the estimator compared to standard MCMC, provided that the smoothness s of f is commensurate with the dimension d of the domain on which it is defined (i.e.s > d 2 ).CF is thus an example of a method that offers super-√ M convergence.
The main weakness of CF is that its performance can be inferior to ZVCV when the dimension d is high relative to the size M of the dataset; next we discuss how this weakness can be addressed.

Mixed Basis.
To address the poor performance of CF relative to CV in the highdimensional context, South et al. (2021) generalised the approaches discussed in Section 3.2.1 and Section 3.2.2, to consider functional approximations of the form f (x) = θ1 + θ2(x) + J j=1 θ 2,j AP φj(x), 24.
where the parameters θ, consisting of θ1 ∈ R, θ2 ∈ H(kP ) and θ 2 ∈ R J , are about to be specified.Notice that one recovers the same form of approximation used in ZVCV, i.e.Equation ( 22), as the special case where θ2 = 0. Similarly one can show that the same form as CF is recovered when θ 2 = 0. Thus Equation ( 24) represents a strict generalisation of ZVCV and CF, and one may hope to obtain the 'best of both worlds', in terms of the superior performance of ZVCV in high dimensions and the super-√ M convergence of CF.The performance of this hybrid approach depends on how the parameters θ are selected.Following Sard (1949), South et al. (2021) propose to select θ such that the following properties are satisfied: 1. f = f for all f ∈ span{1} ⊕ AP Φ, where Φ = span{φ1, . . ., φJ } 2. LS(f − f ) = 0 3. θ2 minimises θ2 → θ2, θ2 H(k P ) subject to the first two properties being satisfied.
The first property is to ensure semi-exactness and the second is an interpolation requirement.The third property amounts to minimising the semi-norm |θ| = θ2, θ2 H(k P ) and serves to ensure uniqueness of θ and to penalise complexity, similarly to CF.This method is referred to as a semi-exact control functional (SECF) and the closed-form solution for the estimator is µ SECF (f ) = e 1 (Φ K −1 P Φ) −1 Φ K −1 P f .If there are parameters in the kernel k that must be specified, then cross validation can be applied in a similar way to that described in Section 3.2.2.Similarly to CF, a possible weakness of this hybrid approach is that the Monte Carlo estimator obtained does not depend on the weights wi appearing in Equation (13).South et al. (2021) demonstrated that such a hybrid approach can indeed enjoy the advantages of both ZVCV and CF; we illustrate this below in Section 3.2.4.1.Open-source software is available for ZVCV, CF and SECF, via the ZVCV package (South 2020) on the comprehensive R archive network (CRAN).The required input for this package is a set of M samples and the associated evaluations of f (•) and ∇ log p(•).
3.2.4.Practical Considerations.Earlier we alluded to the construction of control variates being more an 'art' than a science; here we provide practical recommendations based on our personal experience using control variates to post-process MCMC.
3.2.4.1.Choosing a Control Variate Method.Choosing between various control variate methods, like ZVCV, CF and SECF, is non-trivial.Cross-validation approaches are computationally expensive and prone to incorrect decisions due the need to reduce the sample size in each fold.It would therefore be helpful to have an understanding of the theoretical properties of different methods.Unfortunately, such theoretical analyses are under-developed at present.Specifically, the theory that does exist tends to involve assumptions that are difficult to verify in practice, if they hold at all.Table 1 summarises the current state of knowledge for the methods that we have discussed.

Bias-correcting:
Capable of removing asymptotic bias in certain biased MCMC algorithms.SECF are bias-correcting will not come as a surprise to the reader in light of the discussion in Section 2.3.1.A perhaps more useful approach to selection of a control variate method is to explore their empirical performance in the context of a synthetic test-bed.

The positive entries in
Example.Here we compare the performance of different control variate methods on a simple toy example that aims to represent the (relatively common) situation in which P is approximately Gaussian, which may hold in applications for which there is a Bernsteinvon-Mises limit.For illustrative purposes, we use a 1-dimensional unit Gaussian distribution with density p(x) = (2π) −1/2 exp(−x 2 /2) and we estimate the posterior expectation of f (x) = 1 + x + x 2 + sin(πx) exp(−x 2 ), for which one can verify f dP = 2.This function f was chosen because the combination of complex behaviour near x = 0 and polynomial behaviour in the tails present challenges for both the parametric and non-parametric methods.
For simplicity we consider an idealised MCMC algorithm that samples Xi independently from P , and we consider the vanilla Monte Carlo (MC) estimator 1 N N i=1 f (Xi) as our starting point; i.e. we seek to reduce the variance of this Monte Carlo estimator using a control variate method.
The results are shown in Figure 5.Here the approximating function f for ZVCV is a second order order polynomial6 , which provides a poor approximation in the region where there are data but provides a good approximation in the tail (Section 3.2.4.1).For CF7 , the interpolant f performs well in regions where there are data, less so in the tail (Section 3.2.4.1).In contract, SECF8 is seen to enjoy the 'best of both worlds', behaving like CF in the region of the data and like ZVCV in the tail (Section 3.2.4.1).Examining the sampling distribution of these estimators through repeated simulation, we observe a remarkable increase in accuracy as a result of post-processing the MCMC output (Section 3.2.4.1).Although the total computing time for the 100 repeated simulations increases from approximately 0.02 seconds for vanilla Monte Carlo integration to 0.11 seconds for ZVCV, 0.17 seconds for CF and 0.19 seconds for SECF, all three control variate methods improve upon the vanilla Monte Carlo estimate in terms of the overall efficiency measured by the product of mean square error and computing time.
3.2.4.2.Computational Cost.For many problems the benefit provided by control variates is not justified when the computational cost of implementation is taken into account (see Table 1).However, when the cost of obtaining MCMC output, or the cost of evaluating f on MCMC output, is sufficiently high then control variates can be a useful tool.For borderline cases, Si et al. (2020) demonstrated the use of stochastic gradient descent to speed up the optimisation in Step 3 of the general recipe to select a Monte Carlo estimator.A reduced-cost SECF method, based on a low-rank Nyström approximation, was also proposed in South et al. (2021).
3.2.4.3.Curse of Dimension.The gradient-based control variates that we discussed suffer from a curse of dimension, which is most evident in kernel methods like CF.However, the regression perspective in Equation ( 20) suggests that, by analogy with high-dimensional regression modelling (Bühlmann & Van De Geer 2011), it may be possible to construct control variates for functions f whose effective dimension is small, despite a high ambient dimension of X .Additional regularisation can be introduced to this effect (South et al. 2018, Wan et al. 2019), with positive results reported for d ≤ 100.For even larger d, it may be sensible to pursue nonlinear approximation (DeVore 1998), where the basis Φ is restricted to allow dependence only on a subset of the parameters (so-called a priori regularisation in South et al. 2018).

Summary
This section focused on the application of gradient-based control variates to approximate an integral of interest based on output from MCMC.Applications to other sampling algorithms, such as population MCMC (Oates et al. 2016), stochastic gradient Langevin dynamics (Baker et al. 2019), sequential Monte Carlo (South et al. 2018) and unbiased MCMC with couplings (South et al. 2019), have also been considered and much of our discussion applies unchanged.Applications to estimation of the normalising constant of the posterior have also been considered in the population MCMC and sequential Monte Carlo sampler settings (Oates et al. 2016, South et al. 2018).Again, the extension is straightforward and consists of applying the ideas from this section to improve multiple expectations.The ZVCV package (South 2020) on CRAN provides functions to apply ZVCV and CF to two estimators of the normalising constant.
A current weakness of control variate methodology is that it is under-developed from a theoretical perspective; our focus was on sets of control variates that form linear subspaces of L 2 (P ), for which some limited theoretical understanding has been achieved, but more sophisticated sets of control variates have also been empirically considered.For example, Wan et al. (2019), Si et al. (2020) proposed to use the gradients of neural network for the set Φ. A neural network is parameterised by a collection of weights and biases, which are jointly estimated using stochastic gradient descent applied to a proxy for mean square error, as discussed in Section 3.1.3.These authors found empirically that this approach can lead to improved performance over methods like ZVCV and CF in the high dimensional context.In light of the anticipated technical complexity required to analyse such sophisticated control variate methods, we expect that empirical assessment will continue to be the primary means through which control variate methodology is developed and assessed.

DISCUSSION
MCMC has become a core part of most graduate programmes in Statistics, due to its effectiveness in enabling Bayesian analyses to be performed.Perhaps understandably, these programmes focus on the design and validity of algorithms, emphasising the elegant probabilistic arguments that are often involved.However, this leaves little or no time to discuss post-processing of MCMC output.In fact, our impression is that many professional users of MCMC are also not aware of this aspect, beyond convergence diagnostics and burn-in removal.Through writing this review, we hope greater attention may be given to this under-appreciated but important practical side of MCMC.In particular, the topic is receiving considerable attention from computational researchers at the time of writing, and we extend an invitation to the interested reader to explore further into the recent works cited.

DISCLOSURE STATEMENT
Aside from being authors of some of the literature that was discussed, the authors are not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.

Figure 2 :
Figure 2: Convergence diagnostics for the MCMC output shown in Figure 1.Here we show the traditional R statistic of Brooks & Gelman (1998) (BG; black solid lines) and also an autocorrelation based diagnostic used by Vats & Knudson (2018) (VR; blue and red solid lines), each as a function of the number of iterations N of the Markov chain that are considered.These diagnostics were applied separately to the first and second coordinates of the bivariate state variable (left and central panels) and jointly to both coordinates (right panel).Dashed lines indicate thresholds at which convergence is deemed to have occurred.In all cases, the traditional R statistic does not fall below the thresholds, indicating that convergence has not occurred.

Figure 3 :
Figure3: Post-processing of MCMC output via Stein thinning.Here a Markov chain sample path (gray line) is post-processed to select M = 12 representative states (black circles), such that the discrete measure supported on these representative states provides an accurate approximation to the same distributional target P as in Figure1, indicated by the shaded contour plot in the background.
). Approaches based on Stein discrepancy minimisation include black-box importance sampling (Liu & Lee 2017, Hodgkinson et al. 2020), Stein points (Chen et al. 2018, 2019), and Stein thinning (Riabiz et al. 2021).In what follows we describe the Stein thinning approach of Riabiz et al. (2021), illustrated in Figure 3 in the setting with equal weights wi = 1 M , and defer to Liu & Lee (2017), Hodgkinson et al. (2020), Riabiz et al. (2021) for discussion of the case in which weights are not equal.

Figure 4 :
Figure 4: Extensions to Stein thinning: Here the sample path (gray line) in Figure 3 is postprocessed to select M = 12 representative states (black circles), to provide an approximation to the distributional target P , indicated by the shaded contour plot in the background.Left: Myopic selection of states (look ahead horizon s = 1).Centre: Non-myopic selection of states, with a look-ahead horizon of s = 4. Right: Non-myopic selection of states, with a look-ahead horizon of s = 12.[Integers indicate the iteration of the algorithm in which a given state was selected.] Figure 5: Gradient-based control variates in a toy example.Figures 5a to 5c show the function f of interest (black line) along with the values f (Xi) computed at the random locations (Xi) i≤N , N = 20.These data are used to construct approximations f (red line) to f , in each of the methods ZVCV, CF and SECF.Section 3.2.4.1 shows boxplots of 100 independent estimates for the integral f dP of interest.

Table 1 :
Table 1 should be interpreted as there being (possibly strong) theoretical assumptions under which the result has been established.The fact that CF and Properties of the control variate methods we have discussed.